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Abstract 

Background: Diagnostic models are frequently used to assess the role of risk factors on disease complications, and 
therefore to avoid them. Missing data is an issue that challenges the model making. The aim of this study was to 
develop a diagnostic model to predict death in HIV/ AIDS patients when missing data exist. 

Methods: HIV patients (n=1460) referred to Voluntary Consoling and Testing Center (VCT) of Shiraz southern Iran 
during 2004-2009 were recruited. Univariate association between variables and death was assessed. Only variables 
which had univariate P< 0.25 were selected to be offered to the Multifactorial models. First, patients with missing data 
on candidate variables were deleted (C-C model). Then, applying Multivariable Imputation via Chained Equations 
(MICE), missing data were imputed. Logistic regression was fitted to C-C and imputed data sets (MICE model). Models 
were compared in terms of number of variables retained in the final model, width of confidence intervals, and 
discrimination ability. 

Result: About 22% of data were lost in C-C model. Number of variables retained in the C-C and MICE models was 2 
and 6 respectively. Confidence Intervals (C.I.) corresponding to C-C model was wider than that of MICE. The MICE 
model showed greater discrimination ability than C-C model (70% versus 64%). 

Conclusion: The C-C analysis resulted to loss of power and wide CI's. Once missing data were imputed, more variables 
reached significance level and C.I.'s were narrower. Therefore, we do recommend the application of the imputation 
method for handling missing data. 
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Introduction 

In epidemiological studies, researchers usually 
try to assess the associations between several 
predictors and an outcome variable using multi- 
variable regression techniques. Once such 
predictors are identified, through diagnostic 
models, they will be combined to get a risk 
score. The estimated risk scores are then con- 
verted to estimate the risk of a disease related 
event. In other words, diagnostic models enable 
clinicians to estimate the risk of a disease-re- 



lated event (such as death due to HIV/ AIDS) 
by combinations of multiple predictor values. 
This helps the management of future patients. 
AIDS is a disease of the human immune system 
caused by the human immunodeficiency virus. 
AIDS is now a pandemic (1). In 2008, an esti- 
mated 35,000 people in the Middle East and 
North Africa became infected with HIV, and 
20,000 AIDS-related deaths occurred (2). Until 
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this year 15% of people who diagnosis with 
HIV/AIDS in Iran died. 

Therefore, identification of risk factors to be 
used so as to predict risk of death for patients is 
necessary. This makes physicians enable to 
identify high risk patients who need special 
care and treatment. 

However, one of the issues that challenge the 
process of model development is missing data. 
Missing data is a common problem in clinical 
data sets (3). In the content of regression 
modeling, most software by default exclude 
every subject from the analysis with at least one 
missing value on any of the predictors or out- 
come analyzed (know as Complete Case (C-C) 
analysis), and offer those with available data to 
the regression model (4). However, usefulness 
of the C-C analysis depends on the rate and the 
mechanism of missing data (see rest of the 
text). In addition these circumstances rarely oc- 
cur in practice. Here we briefly reviewed the 
missing data mechanisms. Three main missing 
mechanism are Missing Completely At Ran- 
dom (MCAR), Missing Not At Random 
(MNAR), and Missing At Random (MAR) (4). 
Missing data are categorized as MCAR when 
subjects with missing data are a random sample 
of data (5). For example, MCAR occurs when a 
blood sample tube is broken or when a 
questionnaire is accidentally lost. Categoriza- 
tion of missing data as MNAR applies when the 
probability that an observation is missing is re- 
lated to unobserved information, such as the 
actual (missing) value of the variable (5). This 
can happen, for example, when a patient is so 
sick that a medical procedure cannot be applied 
to measure a study variable. However, missing 
data are usually neither MCAR nor MNAR but 
MAR (6). MAR applies when the probability 
that an observation is missing is related to other 
observed patient characteristics (4). 
The C-C method seems reasonable under the 
MCAR assumption if less than five percent of 
entire data set is missing (7). However, when 
missing rate is high, even under MCAR mecha- 
nism, exclusion of missing data will diminish 



precision of estimates (4,8). Under other miss- 
ing mechanisms C-C method leads to biased 
estimates (9). 

Various methods have been proposed to deal 
with missing data. There are modern likelih- 
ood-based imputation methods (such as 
Multivariate Imputation via Chained Equations 
(MICE) which recover the missing data avoid- 
ing waste of information. 
The aim of this study was to address the impact 
of exclusion of patients with missing data o 
model composition, and estimated Confidence 
Intervals (C.I.). To do this, we used a set of 
HIV data set as an example 

Materials and Methods 

Sample and outcome 

Making confidential and voluntary counseling 
and testing (VCT) services available has proven 
to encourage people to determine their HIV sta- 
tus (2). In this study we used information on 
1460 infected person with HIV during 2004- 
2009, which were referred to VCT center of 
Shiraz. The main outcome of study was death 
due to HIV. 

Statistical methods 
Variables selected 

Information on a large umber of variables was 
available. However, EPV rule recommends that 
at least ten Events Per independent Variable 
being tested are required (10). Therefore, candi- 
date variables for the multifactorial modeling 
were selected through a series of univariate 
logistic regression analysis. Only variables with 
univariate P-value less than 0.25 were selected 
to be offered to the multifactorial models (11). 

Complete-Case model (C-C model) 
In the C-C model, patients with missing data on 
any of variables selected in the screening round 
were excluded. A logistic regression model in 
conjunction with Backward Elimination (B.E.) 
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variable selection method was then fitted to the 
data. 

Imputation model (MICE model) 
Details of the Multivariable Imputation via 
Chained Equations (MICE) method are pre- 
sented elsewhere (12). Briefly a series of 
regression models are run whereby missing 
value on each variable is predicted upon the 
other variables in the data. The MICE method 
takes into account the uncertainty regarding by 
replace each missing value by multiple imputed 
values, (usually 10) (6, 13). 
It has been noted that by including enough va- 
riables in the imputation model (10 to 15) the 
MAR assumption would be more plausible (14). 
Therefore, all variables, regardless of their uni- 
variate P-value, and patient's outcome were of- 
fered to the imputation model (15). 
As candidate variables for multifactorial model- 
ing was dichotomous, logistic regression model 
was used to impute missing data 10 times. Esti- 
mates derived from 10 imputed data sets (the 
coefficients and standard errors) were com- 
bined applying Rubin's rule (16). Then the vari- 
able with the highest P- value (if >0.05) was 
excluded, and new coefficients and SE's were 
estimated for the rest of variables. This process 
was applied iteratively until all variables remain 
significant in all imputed data sets (analogous 
to B.E. variable selection method in C-C analy- 
sis) (14). Odds Ratios (OR) and corresponding 
95% Confidence Intervals (C.I.) were calcu- 
lated from regression coefficients and standard 
errors that have been imputed across multiply 
imputed data sets. 

Comparison of C-C and MICE models 
Models developed were compared in terms of 
variables contributed to the multifactorial mod- 
els, estimated Standard Errors (S.E.), estimated 
Odds Ratio's (OR) and associated Confidence 
Intervals (C.I.). In addition the probability of 
death was estimated from logistic models 
developed. This was done for both C-C and im- 
puted data sets. The estimated probabilities 



were then compared with the observed patients' 
outcome to compare the models in terms of the 
Area under ROC Curve (AUC). 

Software 

Complete -Case model was fitted using SPSS 
(Version 15) software. Missing data were im- 
puted using MICE package which works under 
R software. 

Results 

Descriptive statistics for variables selected are 
presented in Table 1. Out of 1460 infected 
cases 283 deaths were observed, giving a death 
rate of 9.4%. Approximately forth-fifth of 
cases were from urban areas. Nearly 86% of 
sample was male. Furthermore, about 75% of 
subjects had history of injection, and 23% of 
patients have at least one symptom of HIV. 
Infection with HCV was as high as 65%. 
In a screening process, univariate association 
between 9 variables and death was confirmed at 
a 0.25 significance level. These variables, 
which are selected to be offered to the multifac- 
torial model, are listed in Table 1. For each 
variable, P-value, Odds Ratio (OR) and corres- 
ponding CI in univariate logistic regression 
analysis is presented. In addition, number of 
patients with available data is reported. 
The HCV status of nearly 15% of patients was 
not available (Table 1). Although missing rate 
for the rest of candidate variables was less than 
1%, by exclusion of missing data, the sample 
size was reduced to 1134. This indicates loss of 
about 22% of data. 

Out of 9 variables offered to the C-C model 
only 2 remained significant in the multifactorial 
analysis (Table 2). Those who had history of 
prison, relative to those who did no were at 
higher risk of death with OR= 3.55 (95% C.I: 
2.02, 5.73). In addition OR of death for patients 
with HIV symptoms relative to those without 
were about 3.17 (95% C.I: 2.32, 4.33). 
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Once missing data were imputed, 4 more vari- 
ables retained significant in the multifactorial 
model (6 in the MICE versus 2 in the C-C mod- 
els). This was due to the recovery of informa- 
tion and gain in power. These 4 variables were 
having AIDS syndrome, place of residence, sex, 
and history of surgery. In particular we found 
that risk of death for those with AIDS syn- 
dromes was about 5 times higher than others 
(OR (95% C.I.): 4.71 (2.15, 10.30)). Further- 
more, being female was associated with more 
than 55% reduction in the risk of death. 
As models were developed in conjunction with 
B.E. variable selection method, estimated S.E.'s 
and P-values corresponding to the variables 
reached significance level in the MICE model 
but not in the C-C model are not reported in the 
table. This is because those variables were not 
retained in the final C-C model However, for 
the sake of comparison, estimated (SE, P-value) 



in the first step of C-C and last step of the 
MICE models for those variables are given; 
(0.6, 0.07) and (0.43, <0.001) for having AIDS 
syndrome; (0.21, 0.16) versus (0.18, 0.02) for 
the place of residence, (0.47, 0.10) versus (0.37, 
0.02) for sex, and (0.23, 0.07) versus (0.21, 
0.02) for history of surgery. 
As expected, estimated SE's corresponding to 
the MICE model was always smaller than that 
of the C-C model. In addition, these four vari- 
ables were far from being significant in the C-C 
model but retained in the final model once the 
MICE imputation method was applied. 
We also compared models developed in terms 
of discrimination ability. AUC for C-C model 
was 0.64. Corresponding figures for MICE 
model were 0.70. This indicates nearly 10% 
improvement in discrimination ability of model 
after imputation of missing data. 



Table 1: Descriptive statistics of variables selected for multifactorial analysis 



Variables 


category 


Frequency (%) 


% of cases 
with available 
data 


Univariate OR 
(CI) 


P-value 


Sex 


male 


86.3 


99.9 


0.35 (0.20, 0.59) 


0.0001 




female 


13.2 






History of 


NO 


22.9 


99.9 


2.44(1.64,3.62) 


0.0001 


Prison 


Yes 


77 








History of 


NO 


25.4 


99.9 


1.65 (1.18 ,2.31) 


0.003 


injection 


Yes 


74. 








AIDS Syn- 


NO 


97.7 


99.5 


4.71 (2.15, 


0.0001 


drome 


Yes 


1.8% 




10.30) 




With or with- 


WITH OUT 


73.3 


96.7 


2.95 (2.21,3.91) 


0.0001 


tuberculosis 


« rTTT T 

posnrvE 


1.5 


99.3 


0.27 (1.52,8.64) 


0.004 




NEGATIVE 


97.8 








HCV 


posnrvE 


65.5 


84.1 


1.99(1.33,3.01) 


0.001 




Negative 


18.6 








history of 


NO 


82.4 


99.2 


0.61 (0.41,0.91) 


0.016 


surgery 


Yes 


16.8 








Place of resi- 


Rural 


21.5 


99.6 


1.45 (1.04 ,2.04) 


0.03 


dence 


Urban 


78.1 
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Table 2: Using two kind of model (C-C and MICE) for 5 variables selected 



Variable 


Category 


C-C Model (n=1134) 




MICE Model (n 


=1460) 
























OR (95% C.I.) P- value 


S.E. 


OR (95% C.I.) 


P-value 


S.E. 


With or with- 


WITH OUT 














out HIV 


WITH 


3.17 (2.32,4.33) <0.001 


0.16 


3.09 


(2.31, 


<0.000 


0.15 


symptoms 








4.15) 




1 




Prison 


NO 
















Yes 


3.55 (2.02,5.73) <0.001 


0.25 


2.01 
3.62) 


(1.12, 


0.02 


0.27 


AIDS Syn- 


NO 














drome 


Yes 






4.53 
10.51) 


(1.95, 


<0.001 


0.43 


Place of resi- 


Rural 














dence 


Urban 






1.52 
2.16) 


(1.07, 


0.02 


0.18 


Sex 


male 
















female 






0.43 
0.89) 


(0.21, 


0.02 


0.37 


Surgery 


NO 
















Yes 






0.60 
0.91) 


(0.40, 


0.02 


0.21 



Discussion 



Diagnostic models provide valuable informa- 
tion regarding contribution of variables on 
clinical outcome (17). However, presence of 
missing data makes exercise of model building 
difficult. When missing data exist, and ex- 
cluded, power would be lost. Consequently, va- 
riables with real predictive ability may not con- 
tribute to the multifactorial model. Therefore, 
imputation is necessary. Imputation of missing 
data, relative to C-C analysis, leads to gain in 
power. 

Although understanding of the factors associ- 
ated with HIV/ AIDS death is vital, it should be 
noted that our main goal was not to develop the 
best possible model to predict death due to 
HIV/ AIDS. We did not have information of 
some known and important risk factors such as 
CD4 level. The main goal of our study was to 
demonstrate the impact of exclusion of missing 
data on model composition and its performance. 
It has been emphasize that exclusion of cases 
with missing data resulted in the loss of preci- 



sion and a wider confidence interval. This was 
the case in our analysis. In the C-C model only 
2 variables were retained in the model. How- 
ever, after imputation of missing data, and gain 
in power, 4 more variables were survived in the 
model (6 variables in total). We have seen that 
estimated S.E.'s in the MICE model were nar- 
rower which was due to recovery of informa- 
tion. 

Lots of important risk factors of death due to 
HIV/ AIDS are known. Some studies find a 
strong association between AIDS death rates 
and positive history of primary and secondary 
syphilis rates among men (18). Concurrent 
HIV/ AIDS was associated with more than twice 
the risk of HIV -related death within the 4 
months after diagnosis (19). In addition, injec- 
tion drug use was known to be associated with 
a substantially increased risk of morbidity and 
mortality (20,21). 

We found that patients who have symptom of 
HIV or AIDS syndrome are more at risk than 
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others. Regarding the place of residence, al- 
though metropolitan become more infected 
with HIV but rustic patient are at higher risk of 
death, possibly due to lack of facilities in rural 
areas. We also found that those used methadone, 
were at lower risk of death. This might be due 
to the fact that use of methadone, in comparison 
with illegal drugs, can increase longevity and 
compliance them to referred VCT for more care 
and decrease high risk behavioral on addicted 
patients. 

To address the impact of missing data on 
estimation of the treatment effect, a randomized 
clinical trial of antiretroviral therapy for HIV- 
infected individuals were carried out (22). It has 
been shown that exclusion of missing data re- 
sulted in the underestimation of the true treat- 
ment effect. This was due to selective dropout 
of participants with lower or decreasing CD4 
counts. 

Another approach frequently applied to tackle 
missing data is to perform missing indicator 
analysis. That is to put subjects with missing 
data into a separate category. This method has 
been applied analyzing data of health care 
workers who exposed with HIV-infected blood 
(23). Results of C-C and missing indicator 
were similar in terms of variables contributed to 
the models. However, estimated confidence 
intervals in the C-C model were wider. Results 
were not compared with that of the MICE 
model. 

In another study to identify factors associated 
with non adherence during the maintenance 
phase of HAART significant difference be- 
tween results of the C-C model and model in 
which missing data were imputed was seen (24). 
Although disadvantages on the C-C analysis are 
known, majority of studies prefer to exclude 
missing data (25), possibly due to its simplicity. 
The MICE model, on the other had, is a flexible 
approach to generate multivariate multiple 
imputations. We do recommend application of 
modem imputation methods before develop- 
ment of diagnostic models. 
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